---
title: "Drivers of social equity in renewable energy at the municipal level: The case of local Japanese energy policy and preferences"
author: "Timothy Fraser & Andrew Chapman"
subtitle: "Modeling, Descriptive Statistics, and Visualization"
output: 
  html_document:
  pdf_document:
    toc: TRUE
---


# 1. Introduction

This document describes and records the code used in our paper, "Drivers of social equity in renewable energy at the municipal level: The case of local Japanese energy policy and preferences." Below, we outline the questions we aim to answer, approach, and R code used to answer our questions, one by one.


# 1.1 Load Packages

```{r, message = FALSE, warning=FALSE}
# Load packages
require(tidyverse)
require(readxl)
require(Amelia)
require(distributions3)
require(gtools)
require(car)
require(scales)
require(lmtest)
require(modelr)
require(purrr)
set.seed(12345)
```


# 1.2 Import Data

First, we import data. The excel file *"population_data.xlsx"* contains data for all 1741 Japanese municipalities, while the excel file *"sampling_frame.xlsx"* contains information on each town in the sampling frame. It also contains responses for the actual towns which responded to our survey. Finally, *"regression_table.xlsx"* contains the data for the 47 final towns in our analysis, which we use to produce regression results.
```{r, message = FALSE, warning = FALSE}
# Import population data
pop = read_xlsx("population_data.xlsx")

# Import sampling frame
sample = read_xlsx("sampling_frame.xlsx")

# Import final set of cases
dat = read_csv("regression_table.csv")
```


# 2. Descriptive Statistics

Next, we calculate descriptive statistics for each key concept in our dataset. These are represented in Appendix A in our manuscript.
```{r}
# These are the variable names we care about
order = c(
    # Dependent Variables
    "Equity (y)","Equity per GWh","Burden (100 scale)",
    # Values not weighted by community official preferences,
    # provided for comparison
    "Equity (non-weighted)","Burden (non-weighted)",
    # Technological Factors
    # Renewable Energy
    "RE GWh","Solar GWh","Wind GWh","Biomass GWh",         
    # Renewable Resources
    "PV_output_2018 (kWh/kWp/annum)","Windspeed_2018","Biomass Capacity Factor",
    # Raw Equity Factors
    "CO2 Offset per capita (Total)", "PM Offset per capita (Total)",
    "Savings per kWh (weighted average)", "FAT Revenue per capita (Total)",
    "Net Jobs per capita (Total)","RE kW unwanted per capita",
    # Survey Responses
    "Environmental_conservation", "Climate_change_countermeasures",
    "Pollution_countermeasures","Community_Development",
    "Community_Tax_Base","Electricity_Prices",
    "Employment", "Social_Equity",         
    "Disaster_Resilience","Fair_Labor_Conditions",
    # Survey Responses (Transformed)
    "Com_Tax_average","Env_Clim_average",
    # Demographic Data
    "Population_2010","Income_taxable_per_capita_2010",
    "Unemployment_2010","Age_median_2010","Ratio_Revs_Exp_2010")

# To generate descriptive statistics,
dat %>%
  # Let's pivot this dataset into tidy format.
  pivot_longer(cols = -c(Town, Prefecture, `Power Company`),
               names_to = "measure",
               values_to = "value") %>%
  # Now filter to just the variables we care about for our analysis
  filter(measure %in% order) %>%
  # Now for each measure,
  group_by(measure) %>%
  # We calculate these summary statistics
  summarize(`Median` = median(value, na.rm = TRUE),
            `Average` = mean(value, na.rm = TRUE),
            `Std. Dev.` = sd(value, na.rm = TRUE),
            `Min` = min(value, na.rm = TRUE),
            `Max` = max(value, na.rm = TRUE),
            `Cases` = n())

# This output is used to create Appendix A: Descriptive Statistics.
```
                              

# 3. Representativeness

Next, we would like to verify whether the towns which responded to our survey are representative of the population in terms of the traits that we care about. To do that, first, we must deal with missing data at the population level. We use multiple imputation via the **amelia** package to fill in data points for variables missing data for less than 5 percent of cases. We do this on the entire dataset, incorporating latent trends in this data to more accurately fill in missing data.

First, we calculate the logical bounds of our variables. These are the range of values that we allow missing data points to take on in the multiple imputation process. This helps avoid situations where we might fill in missing data for a scale from 0-100 with, say, 120, and impossible value.

## 3.1 Calculate Logical Bounds for Multiple Imputation
```{r}
bounds = pop %>%
  # pivot to tidy, long form so we can calculate summary statistic easily
  pivot_longer(cols = -c(Code, Town_Japanese, Prefecture_Town_Japanese, 
                         Prefecture_English, Prefecture_Town_English, Town_English),
               names_to = "measure",
               values_to = "value") %>%
  group_by(measure) %>%
  summarize(
    # Calculate percentage of missing data points in that column
    missing_percent = sum(if_else(is.na(value), 1, 0)) / n(),
    # Calculate minimum value for that column
    min = min(value, na.rm = TRUE),
    # calculate maximum value for that column
    max = max(value, na.rm = TRUE)) %>%
  # filter this list to just variables missing less than 5%
  filter(missing_percent < 0.05) %>%
  # create an id number for each measure
  mutate(column.number = 1:n()) %>% ungroup 
```

## 3.2 Multiple Imputation with Amelia
```{r, message = FALSE, warning = FALSE}
# Next, we use multiple imputation on the data
imp = pop %>%
  # selecting just the municipality code and columns identified in the logical bounds calculation,
  # conveniently in the same order as in the logical bounds dataframe 
  select(Code, bounds$measure) %>%
  amelia(m = 5, # calculate  
         idvars = "Code", # omit this variable in imputation
         # max.resample = 10000, # max number of times to resample
         seed = 12345, # unique string for replication
         bounds = bounds %>%
           # rename columns to format expected by amelia, and 
           # remove missing_percent column, which we no longer need
           select(-measure, -missing_percent) %>%
           select(column.number, lower.bound = min, upper.bound = max) %>% 
           # convert to matrix, which amelia expects
           as.matrix()) 

# How much was missing from each?
summary(imp)
```


## 3.3 Calculate Difference of Means across Imputations

```{r}
# Let's use one-sample t-tests to compare whether 
# our average sample average is statistically significantly different 
# from the population average for key demographic and technological variables 
# that we could collect at both levels.
# This allows us to tell how representative the survey sample data used later in the paper is.

# Calculate test statistic (difference of means) for each imputation
imp.stat = function(data){
  data %>%
    summarize(
      # Calculate test statistic
      SolarkW_statistic = t.test(
        dat$SolarkW, 
        mu = mean(data$SolarkW))$statistic,
      WindkW_statistic = t.test(
        dat$WindkW, 
        mu = mean(data$WindkW))$statistic,
      BiomasskW_statistic = t.test(
        dat$BiomasskW, 
        mu = mean(data$BiomasskW))$statistic,
      Population_2010_statistic = t.test(
        dat$Population_2010, 
        mu = mean(data$Population_2010))$statistic,
      Income_taxable_per_capita_2010_statistic = t.test(
        dat$Income_taxable_per_capita_2010, 
        mu = mean(data$Income_taxable_per_capita_2010))$statistic,
      Unemployment_2010_statistic = t.test(
        dat$Unemployment_2010, 
        mu = mean(data$Unemployment_2010))$statistic,
      Age_median_2010_statistic = t.test(
        dat$Age_median_2010, 
        mu = mean(data$Age_median_2010))$statistic,
      Ratio_Revs_Exp_2010_statistic = t.test(
        dat$Ratio_Revs_Exp_2010, 
        mu = mean(data$Ratio_Revs_Exp_2010))$statistic,
      Local_taxes_per_capita_2010_statistic = t.test(
        dat$Local_taxes_per_capita_2010, 
        mu = mean(data$Local_taxes_per_capita_2010))$statistic,
      # Calculate standard error
      SolarkW_se = t.test(
        dat$SolarkW, 
        mu = mean(data$SolarkW))$stderr,
      WindkW_se = t.test(
        dat$WindkW, 
        mu = mean(data$WindkW))$stderr,
      BiomasskW_se = t.test(
        dat$BiomasskW, 
        mu = mean(data$BiomasskW))$stderr,
      Population_2010_se = t.test(
        dat$Population_2010, 
        mu = mean(data$Population_2010))$stderr,
      Income_taxable_per_capita_2010_se = t.test(
        dat$Income_taxable_per_capita_2010, 
        mu = mean(data$Income_taxable_per_capita_2010))$stderr,
      Unemployment_2010_se = t.test(
        dat$Unemployment_2010, 
        mu = mean(data$Unemployment_2010))$stderr,
      Age_median_2010_se = t.test(
        dat$Age_median_2010, 
        mu = mean(data$Age_median_2010))$stderr,
      Ratio_Revs_Exp_2010_se = t.test(
        dat$Ratio_Revs_Exp_2010, 
        mu = mean(data$Ratio_Revs_Exp_2010))$stderr,
      Local_taxes_per_capita_2010_se = t.test(
        dat$Local_taxes_per_capita_2010, 
        mu = mean(data$Local_taxes_per_capita_2010))$stderr) %>%
      return()
}
 
# Collect difference of means and standard error for each imputed dataset
imp.result = imp$imputations %>%
  lapply(imp.stat) %>%
  bind_rows(.id = "imp") %>%
  # Pivot long ways
  pivot_longer(cols = -imp, names_to = "measure", values_to = "value") %>%
  # Classify each data point as a statistic or standard error
  mutate(type = str_extract(measure, "_se|_statistic") %>% str_remove("_"), 
         measure = str_remove(measure, "_se|_statistic")) %>%
  # Pivot wider into a format mi.meld can receive to average results
  pivot_wider(id_cols = measure,
              names_from = c(type, imp),
              values_from = value)


# To prepare for calculating p.values from averaged test statistics, 
# we are going to make a normal t-distribution
require(distributions3)
# calculate the distribution with 46 degrees of freedom
t46 = StudentsT(df = 46)

# Use Rubin's Rules to average the statistic,
# and merge the standard error within imputations and between imputations
mi.meld(q = imp.result %>%
          # grab means
          select(statistic_imp1, statistic_imp2,statistic_imp3,statistic_imp4, statistic_imp5), 
        se = imp.result %>%
          # grab standard deviations
          select(se_imp1,se_imp2,se_imp3,se_imp4,se_imp5),
        # Each column is a set of results from an imputed dataframe
        byrow = FALSE) %>%
  # bind results together
  bind_rows() %>%
  # Add variable names back in
  mutate(measure = imp.result$measure)  %>%
  # reorder and rename columns
  select(measure, stat = q.mi, se = se.mi) %>%
  # Now calculate
  mutate(p.value = 1 - cdf(t46, abs(stat)) + cdf(t46, -abs(stat))) %>%
  # Add asterisks
  mutate(significance = stars.pval(p.value))

```


```{r}
remove(imp.stat, imp.result, t46, order)
detach("package:distributions3", unload = TRUE)
```

This suggests that our sample differs from the population in a statistically significant way for the installed capacity of Wind and Biomass, and the median age. Fortunately, wind and biomass make up a tiny portion of renewables in Japan overall, and most of cities' equity scores are determined by solar anyways, because it constitutes such a large share of Japanese renewables. With these limitations in mind, we proceed with our analysis.

# 4. Sample Survey Descriptive Statistics

Next, we calculate descriptive statistics on results from our survey sample.
```{r}
dat %>% 
  # Select survey responses
  select(Town, Environmental_conservation, Climate_change_countermeasures,
         Pollution_countermeasures,Electricity_Prices,Community_Tax_Base,
         Employment,Fair_Labor_Conditions,Community_Development,Disaster_Resilience,Social_Equity) %>%
  # pivot to tidy format for summary analysis
  pivot_longer(cols = -Town,
               names_to = "measure",
               values_to = "value") %>%
  # For each item in the survey
  group_by(measure) %>%
  # Calculate the mean and standard deviation of their response
  summarize(
    mean = mean(value),
    sd = sd(value))
```


# 5. Modeling

Next, we model our variables. For each model, we display the summary table, and then run diagnostics. We display multicolinearity tests, namely results from the variance inflation factor (VIF) test. The gold standard for VIF is below 2.5, while extremely problematic values tend to be over 10. None of our models have VIFs over 3; most are at or under 2.5. As a result, we proceed with confidence that multicolinearity is not a problem in our models.

## 5.1 Model 1: Effect of Technology on Equity

```{r}
# Model 1: Effect of Technology on Equity ------------

equity = dat %>% 
  # rescale variables so that our beta estimates are standardized
  select(`Equity (y)`, `Solar GWh`, `Wind GWh`, `Biomass GWh`,
               `PV_output_2018 (kWh/kWp/annum)`, Windspeed_2018) %>%
  scale() %>% as.data.frame() %>%
  # run model
  lm(formula = `Equity (y)` ~ `Solar GWh` + `Wind GWh` + `Biomass GWh` + 
               `PV_output_2018 (kWh/kWp/annum)` + Windspeed_2018)

summary(equity)
```

```{r}
# What is the average VIF for this model?
mean(vif(equity))
```

```{r}
# Report individual VIF results
vif(equity) # variance inflation factors
```

Next, we evaluate heteroskedasticity. The plots below indicate some heteroskedasticity, but multiple regression tends to be quite robuts against it. As a result, we also plot residuals to evaluate whether they are normally distributed or are experiencing autocorrelation of some sort. We see little evidence of this.

```{r}
# Diagnostic Plots
plot(equity)
```

```{r}
# Plot residuals
ggplot(mapping = aes(x = equity$residuals)) +
  geom_histogram(fill = "steelblue", color = "white") +
  labs(title = "Distribution of Residuals", x = "Residuals")
# Pretty normally distributed; just limited by small number of cases.
```


## 5.2 Models 2-3: Interaction Effects on Equity

```{r}
equity2 <- dat %>%
  # rescale variables so that our beta estimates are standardized
  select(`Equity (y)`, `Solar GWh`, `Wind GWh`, `Biomass GWh`,
               `PV_output_2018 (kWh/kWp/annum)`, Windspeed_2018) %>%
  scale() %>% as.data.frame() %>%
  # apply interaction efect for Solar and Solar Resources
  lm(formula = `Equity (y)` ~ `Solar GWh` + `Wind GWh` + `Biomass GWh` + 
                `PV_output_2018 (kWh/kWp/annum)` + Windspeed_2018 + 
                `Solar GWh`*`PV_output_2018 (kWh/kWp/annum)`)

summary(equity2)
```

```{r}
# Calculate average variance inflation factor
mean(vif(equity2))
# Variance inflation factor is *always* higher when you use interaction effects, 
# because both a variable and its interaction appear in the model. 
# Since these are obviously related sets of numbers, they are highly correlated.
# As a result, VIF is not hugely useful here.
```

```{r}
vif(equity2)
```


```{r}
equity3 <- dat %>%
  # rescale variables so that our beta estimates are standardized
  select(`Equity (y)`, `Solar GWh`, `Wind GWh`, `Biomass GWh`,
               `PV_output_2018 (kWh/kWp/annum)`, Windspeed_2018) %>%
  scale() %>% as.data.frame() %>%
  # apply interaction efect for Solar and Solar Resources
  lm(formula = `Equity (y)` ~ `Solar GWh` + `Wind GWh` + `Biomass GWh` + 
                `PV_output_2018 (kWh/kWp/annum)` + Windspeed_2018 + 
                `Wind GWh`*Windspeed_2018)
summary(equity3)
```

```{r}
mean(vif(equity3))
# In contrast, this VIF is just absurd, 
# and it is entirely due to the fact that the variables 
# Wind, Windspeed, and their interaction effect are highly correlated.
# This is why interaction effects are useful for further exploring variables,
# but our main results depend on our primary equity model.
```

```{r}
vif(equity3)
```

# 5.3 Transformations of Model 1

```{r}
# These transformations reduces each reduce heteroskedasticity.
# Reciprocal transformation appears to improve it.
dat %>%
  mutate(
    original = `Equity (y)`,
    reciprocal = (1 / `Equity (y)`),
    sqrt = `Equity (y)` %>% sqrt,
    log = `Equity (y)` %>% log) %>%
  select(original, reciprocal, sqrt, log)
```


```{r}
# Next, we transformed equity to adjust for heteroskedasticity.
# which likely led to our non-constant variance (heteroskedasticity)
equity_transformed = dat %>%

  mutate(`Equity (y)` = (1 / `Equity (y)`)  ) %>%
  # rescale variables so that our beta estimates are standardized
  select(`Equity (y)`, `Solar GWh`, `Wind GWh`, `Biomass GWh`,
               `PV_output_2018 (kWh/kWp/annum)`, Windspeed_2018) %>%
  scale() %>% as.data.frame() %>%
  # run model
  lm(formula = `Equity (y)` ~ `Solar GWh` + `Wind GWh` + `Biomass GWh` + 
               `PV_output_2018 (kWh/kWp/annum)` + Windspeed_2018)
summary(equity_transformed)
```

```{r}
# But our residuals STILL appear non-normal - so that's not ideal!
ggplot(mapping = aes(x = equity_transformed$residuals)) +
  geom_histogram(fill = "steelblue", color = "white") +
  labs(x = "Residuals (Reciprocal Transformation)")

# Transforming our results means dramatically changing the meaning of our dependent variable - not ideal - with little evidence that it has improved our distribution of residuals.
# It also removes the effect of a few strong outliers on the data in a dataset that is already small,
# when that is actually an important effect.
# Further removing cases cases jeopardizes the validity of the sample.
# As a result, we invite the readers to read the original results with a cautious eye, understanding that they are shaped by outliers.
```


# 5.4.1 Effect of Community Preferences on Equity Outcomes

```{r}
equity_preference = dat %>%

  # rescale variables so that our beta estimates are standardized
  select(`Equity (y)`, Env_Clim_average, Pollution_countermeasures, 
              Electricity_Prices, Com_Tax_average, Employment) %>%
  scale() %>% as.data.frame() %>%
  lm(formula = `Equity (y)` ~ Env_Clim_average + Pollution_countermeasures + 
              Electricity_Prices + Com_Tax_average + Employment)

summary(equity_preference)
```

```{r}
# Great VIF results - not multicolinearity
mean(vif(equity_preference))
```

```{r}
vif(equity_preference)
```

```{r}
plot(equity_preference)
# Surprisingly alright, given sample size. One outlier (# 9)
```
# 5.4.2 Effect of Community Preferences on unweighted Equity Outcomes

```{r}
equity_preference_unweighted = dat %>%
  # rescale variables so that our beta estimates are standardized
  select(`Equity (non-weighted)`, Env_Clim_average, Pollution_countermeasures, 
              Electricity_Prices, Com_Tax_average, Employment) %>%
  scale() %>% as.data.frame() %>%
  lm(formula = `Equity (non-weighted)` ~ Env_Clim_average +
       Pollution_countermeasures + 
              Electricity_Prices + Com_Tax_average + Employment)


summary(equity_preference)
```

```{r}
# VIF
vif(equity_preference_unweighted)
```

# 5.5 Effect of Community Preferences on Change in Equity Outcomes

```{r}
change_equity_preference = dat %>%

  # rescale variables so that our beta estimates are standardized
  select(`Equity_change`, Env_Clim_average, Pollution_countermeasures, 
              Electricity_Prices, Com_Tax_average, Employment) %>%
  scale() %>% as.data.frame() %>%
  lm(formula = `Equity_change` ~ Env_Clim_average + Pollution_countermeasures + 
              Electricity_Prices + Com_Tax_average + Employment)
summary(change_equity_preference)

```

## Multicolinearity diagnostics
```{r}
vif(change_equity_preference)
# Excellent Results
```

## Heteroskedasticity diagnostics
```{r}
plot(change_equity_preference)
# Remarkably good considering the sample size.
```

# 5.6 Effect of Equity factors on Equity & Equity per GWh
```{r}
# (or, why we had to use pearson product moment correlation coefficents instead)

epg.tech <- dat %>%
  select(`Equity (y)`, `CO2 Offset (weighted)`, `PM Offset (weighted)`,
                 `Electricity Prices (weighted)`, `Community Tax (weighted)`,
                 `Employment (weighted)`, `Satisfaction (weighted)`) %>%
  scale() %>% as.data.frame() %>%
  lm(formula = `Equity (y)` ~ `CO2 Offset (weighted)` + `PM Offset (weighted)` +
                 `Electricity Prices (weighted)` + `Community Tax (weighted)`+
                 `Employment (weighted)` + `Satisfaction (weighted)`)

vif(epg.tech) # can't use because of multicolinearity
```

```{r}
epg.tech <- dat %>%
  select(`Equity per GWh`, `CO2 Offset (weighted)`, `PM Offset (weighted)`,
                 `Electricity Prices (weighted)`, `Community Tax (weighted)`,
                 `Employment (weighted)`, `Satisfaction (weighted)`) %>%
  scale() %>% as.data.frame() %>%
  lm(formula = `Equity per GWh` ~ `CO2 Offset (weighted)` + `PM Offset (weighted)` +
                 `Electricity Prices (weighted)` + `Community Tax (weighted)`+
                 `Employment (weighted)` + `Satisfaction (weighted)`)
# can't use because of multicolinearity
vif(epg.tech)
```

6. Effect of Demographics on Equity per GWh

```{r}
# Finally, we test whether social and demographic measures seem related to Equity per GWh.
# Put in other words, do certain kinds of communities end up with higher equity potential than others?

# First, we do an exploratory analysis with correlations.

# Let's pivot the data into a longer format for analysis
more_longer = dat %>%
  select(
     Town, Prefecture,
    `Equity per GWh`, `Equity (y)`, 
    Income_taxable_per_capita_2010, Unemployment_2010,
         Ratio_Revs_Exp_2010, Population_2010, Age_median_2010,
    PercentTotal_Migration_2010,
    `Percent_University Educated_2010`) %>%
  # To make several variables easier to read,
  # let's divide by millions
  mutate(Population_2010 = Population_2010 / 1000,
         Income_taxable_per_capita_2010 = Income_taxable_per_capita_2010 / 1000) %>%
 pivot_longer(
    # Holding Town ID and the equity outcomes as indexes
    cols = -c(`Town`, `Prefecture`, `Equity (y)`, `Equity per GWh`),
    names_to = "measure",
    values_to = "value") 

# Which variables are correlated with Equity per GWh? Or Equity, for that matter?
more_longer %>%
  group_by(`Measure` = measure) %>%
  summarize(
    # Calculate Pearson's r for its correlation with Equity
    `Correlation: Equity` = cor(
      `Equity (y)`, value, 
      use = "complete.obs", method = "pearson") %>% 
      # rounded to 3 digits
      round(3),
    # Calculate Pearson's r for its correlation with Equity per GWh
    `Correlation: Equity per GWh` = cor(
      `Equity per GWh`, value, 
      use = "complete.obs", method = "pearson") %>% 
      # rounded to 3 digits
      round(3))

# Looks like Median Age (r = -0.200), 
# Total Migration per capita (r = 0.362), and 
# Taxable Income per capita (r = 0.509) 
# have the strongest associations with Equity per GWh
```


```{r}
# Let's put these in a model!

# We want to test associations between these variables
formula = `Equity per GWh` ~ Population_2010 + 
       Income_taxable_per_capita_2010 + 
       Unemployment_2010 + Age_median_2010 + Ratio_Revs_Exp_2010 + 
       PercentTotal_Migration_2010 + `Percent_University Educated_2010`

# Run the model
epg.dem <- dat %>%
  # Let's log-transform Equity per GWh, which varies greatly a scale that can be assessed with more constant variance
  # Problem: Equity per GWh has positive AND negative values, and log transforming will turn our negative values to NA.
  # To deal with this, we are going to adjust Equity per GWh 
  # by adding a constant, 1, to all negatives to boost all values into positive values,
  # and then calculate the log of this.
  mutate(`Equity per GWh` = log(1 + `Equity per GWh`)) %>%
  # Now, rescale all variables to be mean-centered at 0 (Z-scores) 
  # so we can get standardized effects
    select(`Equity per GWh`,  Income_taxable_per_capita_2010, Unemployment_2010,
         Ratio_Revs_Exp_2010, Population_2010, Age_median_2010, 
  `Percent_University Educated_2010`,
         PercentTotal_Migration_2010) %>%
  scale() %>% as.data.frame() %>%
  # Run model
  lm(formula = formula)

summary(epg.dem)

# We get a statistically significant model, and several statistically significant results.
# These indicate that as Age increases by 1 standard deviation, equity per GWh increases by exp(0.318) standard deviations.
# These also indicate that as Education increases by 1 standard deviation, equity per GWh increases by exp(0.590) standard deviations
```
```{r}
mean(vif(epg.dem))
```

```{r, warning = FALSE, message = FALSE, fig.height=9, fig.width=6}
more_longer %>%
  # Recode names to be more readable
  mutate(measure = dplyr::recode_factor(measure,
    "Population_2010" = "Population (thousands) (2010)",
    "Income_taxable_per_capita_2010" = "Income per capita (thousands) (2010)",
    "Unemployment_2010" = "Unemployment Rate (2010)",
    "Age_median_2010" = "Median Age (2010)",
    "Ratio_Revs_Exp_2010" = "Revenues vs. Expenditures (2010)",
    "PercentTotal_Migration_2010" = "(%) Total Migration (2010)",
    "Percent_University Educated_2010" = "(%) University Educated (2010)")) %>%
  # Display only the terms used in the model
  filter(measure %in% c("Population (thousands) (2010)", "Income per capita (thousands) (2010)", 
                        "Unemployment Rate (2010)", "Median Age (2010)", 
                        "Revenues vs. Expenditures (2010)", "(%) Total Migration (2010)",
                        "(%) University Educated (2010)")) %>%
  # Map aesthetics, transforming Equity per GWh in the same way we did in the model
  ggplot(mapping = aes(
    x = value, 
    y = log(1 + `Equity per GWh`),
    # size the points by the actual Equity per GWh outcome
    size = `Equity per GWh`)) +
    # Create a Loess-smoothed line of best fit
   geom_smooth(color = "darkgray") +
  # Map points with transparency
  geom_point(color = "steelblue", alpha = 0.75) + 
  # Split into 7 different plots for each predictor variable
  facet_wrap(~measure, scales = "free", nrow = 4, ncol = 2) +
  # set theme
  theme_minimal() +
  # Put legend on the bottom and center the title
  theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5)) +
  # Get rid of the color legend
  guides(color = FALSE) +
  # Label axes
  labs(y = "Equity per Gigawatt-hour (log-transformed)",
       x = "",
       size = "Equity per Gigawatt-hour (actual)",
       title = "Social Drivers of Equity per GWh",
       caption = "Independent relationships of social factors with equity potential. Lines depict Loess-smoothed trend.")

# Great. So weve found some independent justification for our model results to. 
# There does seem to be a relationship between equity outcomes and the predictors age and total migration.
```

However, this is a very small sample. The fact that we garnered statistically significant results is great, but we would feel better if we could use another method that retains validity despite small sample sizes. For this reason, we turn to permutation tests.

Permutation tests create 1000 different versions of the dataset, where everything remains the same, except the outcome variable gets randomly shuffled. This allows us to see what statistics we would get if the observed outcomes were assigned by chance. The more permutations, the higher detail we can use to justify whether our results really are statistically significant.

```{r}
# First, we can gather 1000 permutations of the dataset

perms <- dat %>%
  # Transform outcome to decrease heteroskedasticity
  mutate(`Equity per GWh` = log(`Equity per GWh` + 1)) %>%
  # Select variables for model
  select(`Equity per GWh`,   Income_taxable_per_capita_2010, Unemployment_2010,
         Ratio_Revs_Exp_2010, Population_2010, Age_median_2010, 
         `Percent_University Educated_2010`,
         PercentTotal_Migration_2010) %>%
  # rescale variables
  scale() %>% as.data.frame() %>%
  # Permute 1000 times, shuffling the Equity per GWh variable
  permute(n = 1000, `Equity per GWh`)
```

```{r}
# Second, we run models on these 1000 permuted datasets
models <- map(perms$perm, ~ lm(formula = formula, data = .))
remove(perms)
```

```{r}
# Third, we collect model-level statistics from each model using the glance function
null.model <- map_df(models, broom::glance, .id = "id")

# Fourth, we collect the coefficient tables for each model using the tidy function
null.coef <- map_df(models, broom::tidy, .id = "id")
```

```{r}
# Was the observed model statistic (F-statistic) more extreme than the statistics generated by chance?
# Evaluate if it was greater than or not (TRUE/FALSE == 1/0)
# Add together, and divide by total
mean(null.model$statistic > broom::glance(epg.dem)$statistic)
```



```{r}
null.coef %>% 
  # Grab the estimate
  select(id, term, estimate) %>%
  # Join in the observed estimate and std. error
  left_join(by = "term",
            y = epg.dem %>% 
              tidy() %>% 
              select(term, 
                     estimate.obs = estimate, 
                     std.error.obs = std.error)) %>%
  # For each term,
  group_by(term) %>%
  summarize(
    # Get the original observed estimate
    estimate.perm = mean(estimate.obs),
    # and calculate the percentile
    # how many times out of the total that the observed was more extreme than the null
    percentile = mean(estimate > mean(estimate.obs))
  ) %>%
  # Now convert that into a p.value by...
  # Getting how far it is from the middle
  mutate(
    estimate.perm = estimate.perm %>% round(3),
    p.value = if_else(
    percentile > 0.50, 
    1 - percentile, 
    percentile))
```

Good news; pretty solid verification of results via permutation tests as well, which tend to be more robust to small-sample sizes.


```{r}
remove(null.coef,
       null.model,
       models, more_longer)



```



7. Bivariate correlations to find pearson product moment correlation coefficients

Because we couldn't assess the effect of equity components on the equity scores we created due to multicolinearity, instead, we use pearson's r, the correlation coefficient, to describe this.
```{r}

dat %>%
  # Select equity outcomes and equity components (raw measures used to build equity outcomes)
  select(`Town`, `Equity (y)`, `Equity per GWh`,
         `CO2 Offset per capita (Total)`,`PM Offset per capita (Total)`,
         `Savings per kWh (weighted average)`,`FAT Revenue per capita (Total)`,
         `Net Jobs per capita (Total)`,`RE kW unwanted per capita`) %>%
  # Now pivot longer into a tidy format
  pivot_longer(
    # Holding Town ID and the equity outcomes as indexes
    cols = -c(`Town`, `Equity (y)`, `Equity per GWh`),
    names_to = "measure",
    values_to = "value") %>%
  # Now for each equity component
  group_by(`Measure` = measure) %>%
  summarize(
    # Calculate Pearson's r for its correlation with Equity
    `Correlation: Equity` = cor(
      `Equity (y)`, value, 
      use = "complete.obs", method = "pearson") %>% 
      # rounded to 3 digits
      round(3),
    # Calculate Pearson's r for its correlation with Equity per GWh
    `Correlation: Equity per GWh` = cor(
      `Equity per GWh`, value, 
      use = "complete.obs", method = "pearson") %>% 
      # rounded to 3 digits
      round(3))
    
```




